10 research outputs found

    Architectural techniques to extend multi-core performance scaling

    Get PDF
    Multi-cores have successfully delivered performance improvements over the past decade; however, they now face problems on two fronts: power and off-chip memory bandwidth. Dennard\u27s scaling is effectively coming to an end which has lead to a gradual increase in chip power dissipation. In addition, sustaining off-chip memory bandwidth has become harder due to the limited space for pins on the die and greater current needed to drive the increasing load . My thesis focuses on techniques to address the power and off-chip memory bandwidth challenges in order to avoid the premature end of the multi-core era. ^ In the first part of my thesis, I focus on techniques to address the power problem. One option to cope with the power limit, as suggested by some recent papers, is to ensure that an increasing number of cores are kept powered down (i.e., dark silicon) due to lack of power; but this option imposes a low upper bound on performance. The alternative option of customizing the cores to improve power efficiency may incur increased effort for hardware design, verification and test, and degraded programmability. I propose a gentler evolutionary path for multi-cores, called successive frequency unscaling ( SFU), to cope with the slowing of Dennard\u27s scaling. SFU keeps powered significantly more cores (compared to the option of keeping them \u27dark\u27) running at clock frequencies on the extended Pareto frontier that are successively lowered every generation to stay within the power budget. ^ In the second part of my thesis, I focus on techniques to avert the limited off-chip memory bandwidth problem. Die-stacking of DRAM on a processor die promises to continue scaling the pin bandwidth to off-chip memory. While the die-stacked DRAM is expected to be used as a cache, storing any part of the tag in the DRAM itself erodes the bandwidth advantage of die-stacking. As such, the on-die space overhead of the large DRAM cache\u27s tag is a concern. A well-known compromise is to employ a small on-die tag cache (T)forthetagmetadatawhilethefulltagstaysintheDRAM.However,tagcachingfundamentallyrequiresexploitingpage−levelmetadatalocalitytoensureefficientuseofthe3−DDRAMbandwidth.Plainsub−blockingexploitsthislocalitybutincursholesinthecache(i.e.,diminishedDRAMcachecapacity),whereasdecoupledorganizationsavoidholesbutdestroythislocality.IproposeBandwidth−EfficientTagAccess(BETA)DRAMcache(β ) for the tag metadata while the full tag stays in the DRAM. However, tag caching fundamentally requires exploiting page-level metadata locality to ensure efficient use of the 3-D DRAM bandwidth. Plain sub-blocking exploits this locality but incurs holes in the cache (i.e., diminished DRAM cache capacity), whereas decoupled organizations avoid holes but destroy this locality. I propose Bandwidth-Efficient Tag Access (BETA) DRAM cache (β ) which avoids holes while exploiting the locality through various metadata organizational techniques. Using simulations, I conclusively show that the primary concern in DRAM caches is bandwidth and not latency, and that due to β2˘7stagbandwidthefficiency,β\u27s tag bandwidth efficiency, β with a Tperforms15 performs 15% better than the best previous scheme with a similarly-sized T

    TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for On-line Data-Intensive Applications

    Get PDF
    Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see latency variations both in the network and compute. Previous work proposes to achieve load-proportional energy by slowing down the computation at lower datacenter loads based directly on response times (i.e., at lower loads, the proposal exploits the average slack in the time budget provisioned for the peak load). In contrast, we propose TimeTrader to reduce energy by exploiting the latency slack in the sub- critical replies which arrive before the deadline (e.g., 80% of replies are 3-4x faster than the tail). This slack is present at all loads and subsumes the previous work's load-related slack. While the previous work shifts the leaves' response time distribution to consume the slack at lower loads, TimeTrader reshapes the distribution at all loads by slowing down individual sub-critical nodes without increasing missed deadlines. TimeTrader exploits slack in both the network and compute budgets. Further, TimeTrader leverages Earliest Deadline First scheduling to largely decouple critical requests from the queuing delays of sub- critical requests which can then be slowed down without hurting critical requests. A combination of real-system measurements and at-scale simulations shows that without adding to missed deadlines, TimeTrader saves 15-19% and 41-49% energy at 90% and 30% loading, respectively, in a datacenter with 512 nodes, whereas previous work saves 0% and 31-37%.Comment: 13 page

    Statistical Wear Leveling for PCM: Protecting Against the Worst Case Without Hurting the Common Case

    Get PDF
    Phase change memory (PCM) is emerging as a lead alternative to DRAM due to its good combination of speed, density,energy, and reliability. However, PCM can endure far fewer overwrites than DRAM before wearing out. PCM is susceptible tomalicious or accidental overwrites which can wear out a frame in a few hundreds of seconds. Previous papers have proposed torandomize periodically the address-to-frame mapping in a memory region. Each randomization involves remapping the region’smemory blocks which incurs significant write overhead. To guarantee reasonable worst-case lifetimes, the papers assume thatevery write overwrites the same memory block and incur either high write overhead for normal applications (i.e., the commoncase), or permanent, high hardware overhead (i.e., in all cases). We make the key observation that the overwrite rates of normalapplications (i.e., common case) are orders-of-magnitude lower than that of the worst case. However, naively measuring theoverwrite rate using brute-force hardware would incur significant complexity and power. Instead, we apply basic statistical samplingto estimate accurately the overwrite rate while requiring a small sampling buffer. Our approach, called statistical wear leveling(SWL), which randomizes address-to-frame mapping on the basis of the estimated overwrite rates instead of writerates. SWL achieves both lower common-case write overhead and lower hardware overhead, and similar, high common-caselifetime as compared to the previous schemes while achieving reasonable worst-case lifetime.1 IntroductionIt i

    Dark Silicon is Sub-Optimal and Avoidable

    Get PDF
    Several recent papers argue that due to the slowing down of Dennard’s scaling of the supply voltage future multicore performance will be limited by dark silicon where an increasing number of cores are kept powered down due to lack of power. Customizing the cores to improve power efficiency may incur increased effort for hardware design, verification and test, and degraded programmability. In this paper, we show that dark silicon is sub-optimal in performance and avoidable, and that a gentler, evolutionary path for multicores exists. We make the key observations that (1) previous papers examine voltage-frequency-scaled designs on the power-performance Pareto frontier whereas the frontier extends to a new region derived by frequency scaling alone where voltage-scaled designs are infeasible, and (2) because memory latency improves only slowly over generations, performance of future multicores’ workloads will be dominated by memory latency. Guided by these observations and a simple analytical model, we exploit (1) the sub-linear impact of clock speed on performance in the presence of memory latency, and (2) the super-linear impact of throughput on queuing delays. Accordingly, we propose an evolutionary path for multicores, called successive frequency unscaling (SFU). Compared to dark silicon. SFU keeps powered significantly more cores running at clock frequencies on the extended Pareto frontier that are succesively lowered every generation to stay within the power budget. The higher active core count enables more memory-level parallelism, non-linearly offsetting the slower clock and resulting in more performance than that of dark silicon. For memory-intensive workloads, full SFU, where all the cores are powered up, performs 81% better than dark silicon at the 11 nm technology node. For enterprise workloads where both throughput and response times are important, we employ controlled SFU (C-SFU) which moderately slows down the clock and powers many, but not all, cores to achieve 29% better throughput than dark silicon at the 11 nm technology node. The higher throughput non-linearly reduces queuing delays and thereby compensates for the slower clock, resulting in C-SFU’s total response latency to be within +/- 10% of that of dark silicon

    MigrantStore: Leveraging Virtual Memory in DRAM-PCM Memory Architecture

    Get PDF
    With the imminent slowing down of DRAM scaling, Phase Change Memory (PCM) is emerging as a lead alternative for main memory technology. While PCM achieves low energy due to various technology-specific advantages, PCM is significantly slower than DRAM (especially for writes) and can endure far fewer writes before wearing out. Previous work has proposed to use a large, DRAM-based hardware cache to absorb writes and provide faster access. However, due to ineffectual caching where blocks are evicted before sufficient number of accesses, hardware caches incur significant overheads in energy and bandwidth, two key but scarce resources in modern multicores. Because using hardware for detecting and removing such ineffectual caching would incur additional hardware cost and complexity, we leverage the OS virtual memory support for this purpose. We propose a DRAM-PCM hybrid memory architecture where the OS migrates pages on demand from the PCM to DRAM. We call the DRAM part of our memory as MigrantStore which includes two ideas. First, to reduce the energy, bandwidth, and wear overhead of ineffectual migrations, we propose migration hysteresis. Second, to reduce the software overhead of good replacement policies, we propose recentlyaccessed- page-id (RAPid) buffer, a hardware buffer to track the addresses of recently-accessed MigrantStore pages

    apSLIP: A High-performance Adaptive-Effort Pipelined Switch Allocator

    Get PDF
    Switch allocation and queuing discipline has a first-order impact on network performance and hence overall system performance. Unfortunately, there is a fundamental tension between quality of switch allocation and clock-speed. On one hand, sophisticated switch allocators such as iSLIP include dependencies that make pipelining hard. On the other hand, simpler allocators which are pipelineable (and hence amenable to fast clocks) degrade throughput. This paper proposes apSLIP which uses three novel ideas to adaptively pipeline iSLIP at fast clocks. To address the dependence between the grant and request stages in iSLIP, we allow superfluous requests to occur and leverage the VOQ architecture which naturally enables easy availing of the corresponding grants. To address the dependence between the reading and updating of priority counters in iSLIP, we use stale priority values and solve the resulting double booking by privatizing the priority counters and separating the arbitration into odd and even stream. Further, we observe that while iSLIP can exploit multiple iterations to improve its matching strength, such additional iterations deepen the pipeline and add to the network latency. The improved matching strength helps high-load scenarios whereas the increased latency hurts low-load cases. Therefore, we propose an adaptive-effort pipelined iSLIP – apSLIP – which adapts between one iteration (shallow-pipeline) at low loads and two iterations (deep pipeline) at high loads. Simulations reveal that compared to an aggressive 2-cycle router apSLIP improves, on average, end-to-end packet latency in an 8x8 network by 43% and high-load application performance in a 3x3 network by 19% without affecting the low-load benchmarks

    A longitudinal prospective study of septoplasty impact on headache and allergic rhinitis in patients with septal deviation

    No full text
    Objective To measure the severity of allergic rhinitis (AR) and different types of headaches in patients with septal deviation before and after septoplasty. Methods This multicentre, prospective, longitudinal, observational study enrolled patients with deviated nasal septum, nasal symptoms and headaches associated with persistent AR lasting at least 2 months without resolution. The nasal obstruction evaluation (NOSE) scale, immunoglobulin-E (Ig-E) levels and visual analogue scale (VAS) for headache pain severity were evaluated before and after septoplasty using Wilcoxon signed-rank test. Results A total of 196 patients were enrolled in the study (102 males; 94 females). A total of 134 patients (68%) were diagnosed with severe AR and 166 (85%) experienced headaches with AR. The majority (100 of 166 patients; 60%) had sinusoidal headaches, while 25% (42 of 166 patients) reported a combination of sinusoidal headache and migraine and 14% (24 of 166 patients) experienced migraines. A comparison of preoperative and postoperative Ig-E levels, NOSE and VAS scores demonstrated that septoplasty significantly improved AR symptoms and headaches. Although there were significant improvements in headaches overall post-septoplasty, only the sinusoidal components improved, while migraine remained unaffected. Conclusion Septoplasty improved AR and sinusoidal headaches in patients with septal deviation, but migraines remained unaffected

    sj-pdf-2-imr-10.1177_03000605231215168 - Supplemental material for A longitudinal prospective study of septoplasty impact on headache and allergic rhinitis in patients with septal deviation

    No full text
    Supplemental material, sj-pdf-2-imr-10.1177_03000605231215168 for A longitudinal prospective study of septoplasty impact on headache and allergic rhinitis in patients with septal deviation by Shanila Feroz, Muhammad Hamza Dawood, Sheza Sohail, Muhammad Daniyal, Ayesha Zafar, Ukashah Bin Shahid and Shamim Ahmed in Journal of International Medical Research</p

    sj-pdf-1-imr-10.1177_03000605231215168 - Supplemental material for A longitudinal prospective study of septoplasty impact on headache and allergic rhinitis in patients with septal deviation

    No full text
    Supplemental material, sj-pdf-1-imr-10.1177_03000605231215168 for A longitudinal prospective study of septoplasty impact on headache and allergic rhinitis in patients with septal deviation by Shanila Feroz, Muhammad Hamza Dawood, Sheza Sohail, Muhammad Daniyal, Ayesha Zafar, Ukashah Bin Shahid and Shamim Ahmed in Journal of International Medical Research</p
    corecore